The data set was created from information collected from a Kaggle survey to do the survey of customer’s behavior. The data represents details about 400 clients of a company including the unique ID, the gender, the age of the customer and the salary.(Data set source: https://www.kaggle.com/denisadutca/customer-behaviour)
Due to the limited customer background provided by the data, this analysis will focus on the customers who have purchased behavior. The purpose of analysis is to collect information about purchasing decisions and analyze the relationship between these factors. In this way, we can know what factors affect people’s purchasing decisions? Age, salary or gender?
The data analysis of this project is only based on the people who have purchased. So I subset the data set by selecting “purchase=1”. I did a few things of preparation: 1. Check if null data in the dataset: there is no null data in the dataset. 2. Rename of column: rename “EstimatedSalary” to “Salary”. 3. Create the subset of purchased. 4. Separate Data set into the age groups: age 18~24, age 25~34, age 35~44, age 45~54, age 55~60 by breaks and counts in hist.
First look at the age distribution of buyers. The age of buyers is concentrated between 25 and 54 years old. the age range from 35 to 45 years old accounted for 45%; followed by 25 to 34 years old, accounting for about 24%; 45 to 54 accounted for about 20%, the overall distribution plot is slightly skewed right. From a set of data, it can be basically considered that the age from 25 to 54 is the buying group, and the 45 to 54 is the main buying group.
In addition to the age factor, is the income of each age group also one of the factors affecting purchasing power? It can be seen from the plot that people aged from 25 to 34 for the highest income, and the income range is small, with the middle position reaching 121.5k. The second is 35 to 44 year, the income in this range is a bit larger, but the middle position has reached 108k. The income range of 45 to 54 has obviously widened, but at the same time the highest value has also fallen, even lower than the median of the previous age group. When the age is greater than 55, the overall income range increases and the range expands.A interesting thing is: from the last plot, we know the age range 45 to 54 is the main buying group, but the salary’s range is lowest than other group. so,in this project, the high salary is not the key factor to affect the buying power.
## The IQR of salary in each age group:
##
## Age 25~34 IQR: 25000
## Age 35~44 IQR: 43500
## Age 45~54 IQR: 74000
## Age 55~60 IQR: 66000
People usually think that women love shopping more than men. Is this true? In the plot below, we will find there are not bigger different of female and male buyers between the age below 25 and order than 55. But in age 45 to 54, female buyer obviously more than male; but in age 35 to 44, the male buyer will shop more times than female.
Among all buyers, female accounted for 53%, and male accounted for 46%. Female buyers are more than male.
“The Central Limit Theorem states that the distribution of the sample means for a given sample size of the population has the shape of the normal distribution. The theorem is shown with various distributions of the input data in the following sections.” Along with the sample size gets larger, the means of the samples become a normal distribution. Use the central limit theorem to show the age variable of the data set. As shown in the bar plot and histogram above, the age distributions of all groups are skewed to the right. I will use this as an example to illustrate the application of the central limit theorem. The histogram below shows the sample means of 1000 random samples with sample sizes of 10, 20, 30, and 40, which follow a normal distribution.
## The Ages means: 46.39161 SD is: 8.612172
## Sample Size = 10 Mean = 46.2188 SD = 2.625578
## Sample Size = 20 Mean = 46.37655 SD = 1.755917
## Sample Size = 30 Mean = 46.3554 SD = 1.379695
## Sample Size = 40 Mean = 46.35725 SD = 1.162401
“A sample is a portion of the population that is selected for doing the data analysis. The results from this sample are then used to estimate the characteristics of the population.”Sampling will greatly help us save working time, reduce workload, and reduce the difficulty of analysis. There are many different type of sampling.I randomly select 20 samples from customers who have purchased by age, and do analysis with perform simple random sample, Sample Random Sampling Without replacement, and Systematic sampling. Based on the size of each age, I only strata one sample from the each layer to do stratified sampling analysis.
## The mean of Age: 46.39161
## The mean of sample random sampling of Age: 46.20524
## The mean of sample random sampling without Replace of Age: 47.7
## The mean of systematic sampling of Age: 46.64706
## The mean of systematic sampling unequal probability of Age: 48.1
## The mean of stratified sampling of Age: 43.5
The means of sample using simple random sampling without replacement and systematic sampling are similar with the mean of data; The mean of unequal probabilities sample random sampling without Replace and systematic based on inclusion probabilities using the age variable is bigger then mean of data; Stratified sampling is less than the mean of data. It may be caused by the limited size of each age in age variable.
Going back to the original question, compare the age, gender, and income provided by customers who have made purchases to see what kind of relationship they have.
## The correlation coefficient between Age and Salary is: -0.3690264
It can be seen from the plot that the sharp between age and salary is related, but it is not much related to gender. After calculation the correlation of age and salary, result is -0.369, the relationship can be considered that age and income have a negative relationship.
In summary, among all buyers, middle-aged and elderly people with lower incomes are more keen on shopping; gender factors make a little difference in shopping behavior. Therefore, foucus on middle-aged and elderly people, sell products corresponding to the lower and higher income categories, take into account both female and male products. That will be a huge market.